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Abstract 

The CAP Theorem is a frequently cited impossibility 
result in distributed systems, especially among NoSQL 
distributed databases. In this paper we survey some 
of the confusion about the meaning of CAP, includ¬ 
ing inconsistencies and ambiguities in its definitions, 
and we highlight some problems in its formalization. 
CAP is often interpreted as proof that eventually con¬ 
sistent databases have better availability properties than 
strongly consistent databases; although there is some 
truth in this, we show that more careful reasoning is 
required. These problems cast doubt on the utility of 
CAP as a tool for reasoning about trade-offs in practi¬ 
cal systems. As alternative to CAP, we propose a delay- 
sensitivity framework, which analyzes the sensitivity of 
operation latency to network delay, and which may help 
practitioners reason about the trade-offs between con¬ 
sistency guarantees and tolerance of network faults. 


1 Background 


Replicated databases maintain copies of the same data 
on multiple nodes, potentially in disparate geographical 
locations, in order to tolerate faults (failures of nodes or 
communication links) and to provide lower latency to 
users (requests can be served by a nearby site). How¬ 
ever, implementing reliable, fault-tolerant applications 
in a distributed system is difficult; if there are multi¬ 
ple copies of the data on different nodes, they may be 
inconsistent with each other, and an application that is 
not designed to handle such inconsistencies may pro¬ 
duce incorrect results. 

In order to provide a simpler programming model 
to application developers, the designers of distributed 
data systems have explored various consistency guar¬ 
antees that can be implemented ^the database infras¬ 
tructure, such as linearizability pOll . sequential consis¬ 
tency 13811. c ausal consistency 1^ and pipelined RAM 
(PRAM) ll42n . When multiple processes execute opera¬ 
tions on a shared storage abstraction such as a database. 


a consistency model describes what values are allowed 
to be returned by operations accessing the storage, de¬ 
pending on other operations executed previously or 
concurrently, and the return values of those operations. 


Similar concerns arise in the design of multipro¬ 
cessor computers, which are not geographically dis¬ 
tributed, but nevertheless present inconsistent views of 
memory to different threads, due to the various caches 
and buffers employed by modern CPU architectures. 
For example, x86 microprocessors provide a level of 
consistency that is weaker than sequential, but stronger 
than causal consistency 11481] . However, in this paper we 
focus our attention on distributed systems that must tol¬ 
erate partial failures and unreliable network links. 


A strong consistency model like linearizability pro¬ 
vides an easy-to-understand guarantee; informally, all 
operations behave as if they executed atomically on a 
single copy of the data. However, thi^uarantee comes 
at the cost of reduced performance 1^ and fault toler¬ 
ance 0221] compared to weaker consistency models. In 
particular, as we discuss in this paper, algorithms that 
ensure stronger consistency properties among replicas 
are more sensitive to message delays and faults in the 
network. Many real computer networks are prone to 
unbounded delays and lost messages 10, making the 
fault tolerance of distributed consistency algorithms an 
important issue in practice. 


A network partition is a particular kind of communi¬ 
cation fault that splits the network into subsets of nodes 
such that nodes in one subset cannot communicate with 
nodes in another. As long as the partition exists, any 
data modifications made in one subset of nodes cannot 
be visible to nodes in another subset, since all messages 
between them are lost. Thus, an algorithm that main¬ 
tains the illusion of a single copy may have to delay 
operations until the partition is healed, to avoid the risk 
of introducing inconsistent data in different subsets of 
nodes. 


This trade-off was already known in the 1970s 022 . 


23l,|32l,|4^, but it was rediscovered in the early 2000s, 


when the web’s growing commercial popularity made 
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geographic distribution and high availability important 
to many organizations 118, ^. It was originally called 


the CAP Principle by Fox and Brewer 1 3, where 
CAP stands for Consistency, Availability and Partition 
tolerance. After the principle was formalized by Gilbert 
and Lynch |25, 2^ it became known as the CAP Theo¬ 
rem. 

CAP became an influential idea in the NoSQL move¬ 
ment ifsoll . and was adopted by distributed systems prac¬ 
titioners to critique design decisions Bill . It provoked 
a lively debate about trade-offs in data systems, and 
encouraged system designers to challenge the received 
wisdom that stron 
tial for databases 

The rest of this paper is organized as follows; in sec¬ 
tion we compare various definitions of consistency, 
availability and partition tolerance. We then examine 
the formalization of CAP by Gilbert and Lynch ll25n 
in section [3 Finally, in section |4] we discuss some al¬ 
ternatives to CAP that are useful for reasoning about 
trade-offs in distributed systems. 


consistency guarantees were essen- 


Lg coi 
0 . 


2 CAP Theorem Definitions 


CAP was originally presented in the form of “consis¬ 
tency, availability, partition tolerance: pick any two” 
(i.e. you can have CA, CP or AP, but not all three). Sub¬ 
sequent debates concluded that this formulation is mis¬ 
leading jlTl l28l l47r . because the distinction between 
CA and CP is unclear, as detailed later in this section. 
Many authors now prefer the following formulation: if 
there is no network partition, a system can be both con¬ 
sistent and available; when a network partition occurs, 
a system must choose between either consistency (CP) 
or availability (AP). 

Some authors ll^ Idd]] define a CP system as one in 
which a majority of nodes on one side of a partition 
can continue operating normally, and a CA system as 
one that may fail catastrophically under a network par¬ 
tition (since it is designed on the assumption that parti¬ 
tions are very rare). However, this definition is not uni¬ 
versally agreed, since it is counter-intuitive to label a 
system as “available” if it fails catastrophically under a 
partition, while a system that continues partially operat¬ 
ing in the same situation is labelled “unavailable” (see 
section |23]) . 

Disagreement about the definitions of terms like 
availability is the source of many misunderstandings 
about CAP, and unclear definitions lead to problems 


with its formalization as a theorem. In sections 12.11 to 
12.31 we survey various definitions that have been pro¬ 
posed. 


2.1 Availability 

In practical engineering terms, availability usually 
refers to the proportion of time during which a service 
is able to successfully handle requests, or the propor¬ 
tion of requests that receive a successful response. A 
response is usually considered successful if it is valid 
(not an error, and satisfies the database’s safety proper¬ 
ties) and it arrives at the client within some timeout, 
which may be specified in a service level agreement 
(SLA). Availability in this sense is a metric that is em¬ 
pirically observed during a period of a service’s oper¬ 
ation. A service may be available (up) or unavailable 
(down) at any given time, but it is nonsensical to say 
that some software package or algorithm is ‘available’ 
or ‘unavailable’ in general, since the uptime percentage 
is only known in retrospect, after a period of operation 
(during which various faults may have occurred). 

There is a long tradition of highly available wd fault- 
tolerant systems, whose algorithms are designed such 
that the system can remain available (up) even when 
some part of the system is faulty, thus increasing the 
expected mean time to failure (MTTF) of the system as 
a whole. Using such a system does not automatically 
make a service 100% available, but it may increase the 
observed availability during operation, compared to us¬ 
ing a system that is not fault-tolerant. 


2.1.1 The A in CAP 


Does the A in CAP refer to a property of an algorithm, 
or to an observed metric during system operation? This 
distinction is unclear. Brewer does not offer a precise 
definition of availability, but states that “availability is 
obviously continuous from 0 to 100 percent’ ’ O, sug¬ 
gesting an observed metric. Fox and Brewer also use 
the term yield to refer to the proportion of requests that 
are completed successfully ll24ll (without specifying any 
timeout). 

On the other hand, Gilbert and Lynch write: “For 
a distributed system to be continuously available, ev¬ 
ery request received by a non-failing node in the sys¬ 
tem must result in a response’’^ In order to prove a re- 


*This sentence appears to define a property of continuous avail¬ 
ability, but the rest of the paper does not refer to this “continuous” 
aspect. 
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suit about systems in general, this definition interprets 
availability as a property of an algorithm, not as an ob¬ 
served metric during system operation - i.e. they de¬ 
fine a system as being “available” or “unavailable” stat¬ 
ically, based on its algorithms, not its operational status 
at some point in time. 

One particular execution of the algorithm is avail¬ 
able if every request in that execution eventually re¬ 
ceives a response. Thus, an algorithm is “available” un¬ 
der Gilbert and Lynch’s dehnition if all possible exe¬ 
cutions of the algorithm are available. That is, the al¬ 
gorithm must guarantee that requests always result in 
responses, no matter what happens in the system (see 
section I23T]). 

Note that Gilbert and Lynch’s dehnition requires any 
non-failed node to be able to generate valid responses, 
even if that node is completely isolated from the other 
nodes. This dehnition is at odds with Fox and Brewer’s 
original proposal of CAP, which states that “data is con¬ 
sidered highly available if a given consumer of the data 
can always reach some replica” 1124 emphasis original]. 

Many so-called highly available or fault-tolerant sys¬ 
tems have very high uptime in practice, but are in fact 
“unavailable” under Gilbert and Lynch’s dehnition lls^l : 
for example, in a system with an elected leader or pri¬ 
mary node, if a client that cannot reach the leader due 
to a network fault, the client cannot perform any writes, 
even though it may be able to reach another replica. 


2.1.2 No maximum latency 

Note that Gilbert and Lynch’s dehnition of availability 
does not specify any upper bound on operation latency; 
it only requires requests to eventually return a response 
within some unbounded but hnite time. This is conve¬ 
nient for proof purposes, but does not closely match our 
intuitive notion of availability (in most situations, a ser¬ 
vice that takes a week to respond might as well be con¬ 
sidered unavailable). 

This dehnition of availability is a pure liveness prop¬ 
erty, not a safety property ||2]: that is, at any point in 
time, if the response has not yet arrived, there is still 
hope that the availability property might still be ful- 
hlled, because the response may yet arrive - it is never 
too late. This aspect of the dehnition will be impor¬ 
tant in section[2 when we examine Gilbert and Lynch’s 
proofs in more detail. 

(In section|4]we will discuss a dehnition of availabil¬ 
ity that takes latency into account.) 


2.1.3 Failed nodes 

Another noteworthy aspect of Gilbert and Lynch’s def¬ 
inition of availability is the proviso of applying only to 
non-failed nodes. This allows the aforementioned deh¬ 
nition of a CA system as one that fails catastrophically 
if a network partition occurs: if the partition causes all 
nodes to fail, then the availability requirement does not 
apply to any nodes, and thus it is trivially satished, even 
if no node is able to respond to any requests. This dehni¬ 
tion is logically sound, but somewhat counter-intuitive. 


2.2 Consistency 

Consistency is also an overloaded word in data sys¬ 
tems: consistency in the sense of ACID is a very differ¬ 
ent property from consistency in CAP Ini]. In the dis¬ 
tributed systems literature, consistency is usually under¬ 
stood as not one particular property, but as a spectrum of 
models with varying strengths of guarantee. Examples 
of such consistency models include linearizabilitv 1301] . 
sequential consistency Js^j, causal consistency |13] and 
PRAM iH. 

There is some similarity between consistency mod¬ 
els and transaction isolation models such as serializ- 
ability Q, snapshot isolation 10 , repeatable read and 
read committed 11 El. Both describe restrictions on 
the values that operations may return, depending on 
other (prior or concurrent) operations. The difference is 
that transaction isolation models are usually formalized 
assuming a single replica, and operate at the granular¬ 
ity of transactions (each transaction may read or write 
multiple objects). Consistency models assume multi¬ 
ple replicas, but are usually dehned in terms of single¬ 
object aerations (not grouped into transactions). Bailis 
et al. il2tl demonstrate a unihed framework for reason¬ 
ing about both distributed consistency and transaction 
isolation in terms of CAP. 


2.2.1 The C in CAP 


Fox and Brewer 


dehne the C in CAP as one- 


copy serializability (ISR) BISII . whereas Gilbert and 
Lynch lEl dehne it as linearizability. Those dehni- 
tions are not identical, but fairly similar^ Both are 


^Linearizability is a recency guarantee, whereas ISR is not. ISR 
requires isolated execution of multi-object transactions, which lin- 
earizabiliw does not. Both require coordination, in the sense of sec- 

tion lTilfTl . 
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safety properties |13l, i e. restrictions on the possible ex¬ 
ecutions of the system, ensuring that certain situations 
never occur. 

In the case of linearizability, the situation that may 
not occur is a stale read: stated informally, once a write 
operation has completed or some read operation has re¬ 
turned a new value, all following read operations must 
return the new value, until it is overwritten by another 
write operation. Gilbert and Lynch observe that if the 
write and read operations occur on different nodes, and 
those nodes cannot communicate during the time when 
those operations are being executed, then the safety 
property cannot be satisfied, because the read operation 
cannot know about the value written. 

The C of CAP is sometimes referred to as strong con¬ 
sistency (a term that is not formally defined), and con¬ 
trasted with eventual consistency fiSSlia, which is 
often regarded as the weakest level of consistency that 
is useful to applications. Eventual consistency means 
that if a system stops accepting writes and sufficienj^ 
communication occurs, then eventually all replicas will 
converge to the same value. However, as the aforemen¬ 
tioned list of consistency models indicates, it is overly 
simplistic to cast ‘strong’ and eventual consistency as 
the only possible choices. 

2.2.2 Probabilistic consistency 

It is also possible to define consistency as a quantitative 
metric rather than a safety property. For example. Fox 
and Brewer define harvest as “the fraction of the 
data reflected in the response, i.e. the completeness of 
the answer to the query,” and probabilistically bounded 
staleness d studies the probability of a read returning 
a stale value, given various assumptions about the distri¬ 
bution of network latencies. However, these stochastic 
definitions of consistency are not the subject of CAP. 

2.3 Partition Tolerance 

A network partition has long been defined as a com¬ 
munication failure in which the network is split into 
disjoint sub-networks, with no communication possible 
across sub-networks 13211 . This is a fairly narrow class 
of fault, but it does occur in practice ifl^ , so it is worth 
studying. 


^It is not clear what amount of communication is ‘sufficient’. A 
possible formalization would be to require all replicas to converge to 
the same value within finite time, assuming fair-loss links (see section 


2.3.1 Assumptions about system model 

It is less clear what partition tolerance means. Gilbert 
and Lynch 1^ define a system as partition-tolerant 
if it continues to satisfy the consistency and availabil¬ 
ity properties in the presence of a partition. Fox and 
Brewer l24|] define partition-resilience as “the system 
as whole can survive a partition between data replicas” 
(where survive is not defined). 

At first glance, these definitions may seem redundant; 
if we say that an algorithm provides some guarantee 
(e.g. linearizability), then we expect all executions of 
the algorithm to satisfy that property, regardless of the 
faults that occur during the execution. 

However, we can clarify the definitions by observing 
that the correctness of a distributed algorithm is always 
subject to assumptions about the faults that may occur 
during its execution. If you take an algorithm that as¬ 
sumes fair-loss links and crash-stop processes, and sub¬ 
ject it to Byzantine faults, the execution will most likely 
violate safety properties that were supposedly guaran¬ 
teed. These assumptions are typically encoded in a sys¬ 
tem model, and non-Byzantine system models rule out 
certain kinds of fault as impossible (so algorithms are 
not expected to tolerate them). 

Thus, we can interpret partition tolerance as mean¬ 
ing “a network partition is among the faults that are 
assumed to be possible in the system.” Note that this 
definition of partition tolerance is a statement about the 
system model, whereas consistency and availability are 
properties of the possible executions of an algorithm. It 
is misleading to say that an algorithm “provides parti¬ 
tion tolerance,” and it is better to say that an algorithm 
“assumes that partitions may occur.” 

If an algorithm assumes the absence of partitions, and 
is nevertheless subjected to a partition, it may violate 
its guarantees in arbitrarily undefined ways (including 
failing to respond even after the partition is healed, or 
deleting arbitrary amounts of data). Even though it may 
seem that such arbitrary failure semantics are not very 
useful, various systems exhibit such behavior in prac¬ 
tice 0351136 [.Making networks highly reliable is very 
expensive m, so most distributed programs must as¬ 
sume that partitions will occur sooner or later ||2^ . 

2.3.2 Partitions and fair-loss links 

Further confusion arises due to the fact that network 
partitions are only one of a wide range of faults that can 
occur in distributed systems, including nodes failing or 
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restarting, nodes pausing for some amount of time (e.g. 
due to garbage collection), and loss or delay of mes¬ 
sages in the network. Some faults can be modeled in 
terms of other faults (for example, Gilbert and Lynch 
state that the loss of an individual message can be mod¬ 
eled as a short-lived network partition). 


In the design of distributed systems algorithms, 
a commonly assumed system model is fair-loss 
links S. A network link has the fair-loss property if 
the probability of a message not being lost is non-zero, 
i.e. the link sometimes delivers messages. The link may 
have intervals of time during which all messages are 
dropped, but those intervals must be of finite duration. 
On a fair-loss link, message delivery can be made re¬ 
liable by retrying a message an unbounded number of 
times: the message is guaranteed to be eventually deliv¬ 
ered after a finite number of attempts lIlQIl . 


We argue that fair-loss links are a good model of 
most networks in practice: faults occur unpredictably; 
messages are lost while the fault is occurring; the fault 
lasts for some finite duration (perhaps seconds, perhaps 
hours), and eventually it is healed (perhaps after human 
intervention). There is no malicious actor in the network 
who can cause systematic message loss over unlimited 
periods of time - such malicious actors are usually only 
assumed in the design of Byzantine fault tolerant algo¬ 
rithms. 


Is “partitions may occur” equivalent to assuming fair- 
loss links? Gilbert and Lynch ll2-5n define partitions as 
“the network will be allowed to lose arbitrarily many 
messages sent from one node to another.” In this defini¬ 
tion it is unclear whether the number of lost messages 
is unbounded but finite, or whether it is potentially infi¬ 
nite. 


Partitions of a finite duration are possible with fair- 
loss links, and thus an algorithm that is correct in a sys¬ 
tem model of fair-loss links can tolerate partitions of a 
finite duration. Partitions of an infinite duration require 
some further thought, as we shall see in section[3] 


3 The CAP Proofs 


In this section, we build upon the discussion of defini¬ 
tions in the last section, and examine the proofs of the 
theorems of Gilbert and Lynch 12-511 . We highlight some 
ambiguities in the reasoning of the proofs, and then sug¬ 
gest a more precise formalization. 


3.1 Theorems 1 and 2 

Gilbert and Lynch’s Theorem 1 is stated as follows: 

It is impossible in the asynchronous network model 
to implement a read/write data object that guarantees 
the following properties: 

• Availability 

• Atomic consist encj^ 

in all fair executions (including those in which mes¬ 
sages are lost). 

Theorem 2 is similar, but specified in a system model 
with bounded network delay. The discussion in this sec¬ 
tion [3T| applies to both theorems. 

3.1.1 Availability of failed nodes 

The first problem with this proof is the definition of 
availability. As discussed in section 12.1.31 only non¬ 
failing nodes are required to respond. 

If it is possible for the algorithm to declare nodes as 
failed (e.g. if a node may crash itself), then the avail¬ 
ability property can be trivially satisfied: all nodes can 
be crashed, and thus no node is required to respond. Of 
course, such an algorithm would not be useful in prac¬ 
tice. Alternatively, if a minority of nodes is permanently 
partitioned from the majority, an algorithm could define 
the nodes in the minority partition as failed (by crash¬ 
ing them), while the majority partition continues imple¬ 
menting a linearizable register 101. 

This is not the intention of CAP - the raison d’etre 
of CAP is to characterize systems in which a minority 
partition can continue operating independently of the 
rest - but the present formalization of availability does 
not exclude such trivial solutions. 

3.1.2 Finite and infinite partitions 

Gilbert and Lynch’s proofs of theorems 1 and 2 con¬ 
struct an execution of an algorithm A in which a write 
is followed by a read, while simultaneously a partition 
exists in the network. By showing that the execution is 
not linearizable, the authors derive a contradiction. 

Note that this reasoning is only correct if we assume 
a system model in which partitions may have infinite 
duration. 

If the system model is based on fair-loss links, then 
all partitions may be assumed to be of unbounded but 

^In this context, atomic consistency is synonymous with iineariz- 
ability, and it is unrelated to the A in ACID. 
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finite duration (section 12.3.2b . Likewise, Gilbert and 
Lynch’s availability property does not place any upper 
bound on the duration of an operation, as long as it is fi¬ 
nite (section l2.L2b . Thus, if a linearizable algorithm en¬ 
counters a network partition in a fair-loss system model, 
it is acceptable for the algorithm to simply wait for the 
partition to be healed: at any point in time, there is still 
hope that the partition will be healed in future, and so 
the availability property may yet be satisfied. For exam¬ 
ple, the ABD algorithm 101 can be used to implement a 
linearizable read-write roister in an asynchronous net¬ 
work with fair-loss linksO 

On the other hand, in an execution where a partition 
of infinite duration occurs, the algorithm is forced to 
make a choice between waiting until the partition heals 
(which never happens, thus violating availability) and 
exhibiting the execution in the proof of Theorem 1 (thus 
violating linearizability). We can conclude that Theo¬ 
rem 1 is only valid in a system model where infinite 
partitions are possible. 


3.1.3 Linearizability vs. eventual consistency 

Note that in the case of an infinite partition, no infor¬ 
mation can ever flow from one sub-network to the other. 
Thus, even eventual consistency (replica convergence in 
finite time, see section lZ2.1b is not possible in a system 
with an infinite partition. 

Theorem 1 demonstrated that in a system model with 
infinite partitions, no algorithm exists which ensures 
linearizability and availability in all executions. How¬ 
ever, we can also see that in the same system model, no 
algorithm exists which ensures eventual consistency in 
all executions. 

The CAP theorem is often understood as demonstrat¬ 
ing that linearizability cannot be achieved with high 
availability, whereas eventual consistency can. How¬ 
ever, the results so far do not differentiate between lin¬ 
earizable and eventually consistent algorithms: both are 
possible if partitions are always finite, and both are im¬ 
possible in a system model with infinite partitions. 

To distinguish between linearizability and eventual 
consistency, a more careful formalization of CAP is re¬ 
quired, which we give in section [L2l 


^ABD 01 is an algorithm for a single-writer multi-reader regis- 
ter. It was extended to the multi-writer case by Lynch and Shvarts- 
man (4^ . 


3.2 The partitionable system model 

In this section we suggest a more precise formulation of 
CAP, and derive a result similar to Gilbert and Lynch’s 
Theorem 1 and Corollary 1.1. This formulation will 
help us gain a better understanding of CAP and its con¬ 
sequences. 

3.2.1 Definitions 

Define a partitionable link to be a point-to-point link 
with the following properties: 

1. No duplication: If a process p sends a message m 
once to process q, then m is delivered at most once 
by q. 

2. No creation: If some process q delivers a message 
m with sender p, then m was previously sent to q 
by process p. 

(A partitionable link is allowed to drop an infinite num¬ 
ber of messages and cause unbounded message delay.) 

Define the partitionable model as a system model 
in which processes can only communicate via parti¬ 
tionable links, in which processes never crash0 and in 
which every process has access to a local clock that is 
able to generate timeouts (the clock progresses mono- 
tonically at a rate approximately equal to real time, but 
clocks of different processes are not synchronized). 

Define an execution E as admissible in a system 
model M if the processes and links in E satisfy the prop¬ 
erties defined by M. 

Define an algorithm A as terminating in a system 
model M if, for every execution E of A, if E is admissi¬ 
ble in M, then every operation in E terminates in finite 
timeQ 

Define an execution E as loss-free if for every mes¬ 
sage m sent from pto q during E, m is eventually deliv¬ 
ered to q. (There is no delay bound on delivery.) 

An execution E is partitioned if it is not loss-free. 
Note: we may assume that links automatically resend 
lost messages an unbounded number of times. Thus, an 
execution in which messages are transiently lost during 

®The assumption that processes never crash is of course unrealis¬ 
tic, but it makes the impossibility results in section [3.2.21 stronger. It 
also rules out the trivial solution of section U. 1.11 

^Our definition of terminating corresponds to Gilbert and Lynch’s 
definition of available. We prefer to call it terminating because the 
word available is widely understood as refening to an empirical met¬ 
ric (see section im . There is some similarity to wait-free data struc¬ 
tures l2^ . although these usually assume reliable communication and 
unreliable processes. 
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some finite time period is not partitioned, because the 
links will eventually deliver all messages that were lost. 
An execution is only partitioned if the message loss per¬ 
sists forever. 

For the dehnition of linearizability we refer to Her- 
lihy and Wing S. 

There is no generally agreed formalization of even¬ 
tual consistency, but the following corre^onds to a 
liveness property that has been proposed ISilSl]: even¬ 
tually, every read operation read{q) at process q must 
return a set of all the values v ever written by any pro¬ 
cess p in an operation write{p,v). For simplicity, we 
assume that values are never removed from the read set, 
although an application may only see one of the values 
(e.g. the one with the highest timestamp). 

More formally, an inhnite execution E is eventually 
consistent if, for all processes p and q, and for ev¬ 
ery value V such that operation write{p,v) occurs in E, 
there are only hnitely many operations in E such that 
V ^ read{q). 

3.2.2 Impossibility results 

Assertion 1. If an algorithm A implements a termi¬ 
nating read-write register R in the partitionable model, 
then there exists a loss-free execution of A in which R is 
not linearizable. 

Proof Consider an execution Ei in which the initial 
value of R is vi, and no messages are delivered (all 
messages are lost, which is admissible for partitionable 
links). In E\, p hrst performs an operation write{p,V2) 
where V 2 f vi. This operation must terminate in finite 
time due to the termination property of A. 

After the write operation terminates, q performs an 
operation read{q), which must return vi, since there is 
no way for q to know the value V 2 , due to all messages 
being lost. The read must also terminate. This execution 
is not linearizable, because the read did not return V 2 - 

Now consider an execution £2 which extends £1 as 
follows; after the termination of the read{q) operation, 
every message that was sent during E\ is delivered (this 
is admissible for partitionable links). These deliveries 
cannot affect the execution of the write and read op¬ 
erations, since they occur after the termination of both 
operations, so £2 is also non-linearizable. Moreover, £2 
is loss-free, since every message was delivered. □ 

Corollary 2. There is no algorithm that implements 
a terminating read-write register in the partitionable 
model that is linearizable in all loss-free executions. 


This corresponds to Gilbert and Lynch’s Corollary 1, 
and follows directly from the existence of a loss-free, 
non-linearizable execution (assertion[T]|. 

Assertion 3. There is no algorithm that implements 
a terminating read-write register in the partitionable 
model that is eventually consistent in all executions. 

Proof. Consider an execution £ in which no messages 
are delivered, and in which process p performs opera¬ 
tion write{p,v). This write must terminate, since the al¬ 
gorithm is terminating. Process q with p f q performs 
read{q) inhnitely many times. However, since no mes¬ 
sages are delivered, q can never learn about the value 
written, so read{q) never returns v. Thus, £ is not even¬ 
tually consistent. □ 

3.2.3 Opportunistic properties 

Note that corollary |2] is about loss-free executions, 
whereas assertion [3 is about all executions. If we limit 
ourselves to loss-free executions, then eventual consis¬ 
tency is possible (e.g. by maintaining a replica at each 
process, and broadcasting every write to all processes). 

However, everything we have discussed in this sec¬ 
tion pertains to the partitionable model, in which we 
cannot assume that all executions are loss-free. For clar¬ 
ity, we should specify the properties of an algorithm 
such that they hold for all admissible executions of a 
given system model, not only selected executions. 

To this end, we can transform a property 3^ into an 
opportunistic property such that: 

V£: (£ 1= ^ {lossfree{E) (£ |= 3^)) 

or, equivalently; 

V£: (£ 1= <=> (partitioned{E)\J (£ ^ I^)). 

In other words, 3^' is trivially satisfied for executions 
that are partitioned. Requiring 3^' to hold for all execu¬ 
tions is equivalent to requiring 31^ to hold for all loss- 
free executions. 

Hence we dehne an execution £ as opportunistically 
eventually consistent if £ is partitioned or if £ is eventu¬ 
ally consistent. (This is a weaker liveness property than 
eventual consistency.) 

Similarly, we define an execution £ as opportunisti¬ 
cally terminating linearizable if £ is partitioned, or if 
£ is linearizable and every operation in £ terminates in 
hnite time. 
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From the results above, we can see that opportunistic 
terminating linearizability is impossible in the partition- 
able model (corollary |2]l, whereas opportunistic even¬ 
tual consistency is possible. This distinction can be un¬ 
derstood as the key result of CAP. However, it is ar¬ 
guably not a very interesting or insightful result. 


3.3 Mismatch between formal model and 
practical systems 

Many of the problems in this section are due to the 
fact that availability is defined by Gilbert and Lynch as 
a liveness property (section 12.1.21) . Liveness properties 
make statements about something happening eventually 
in an infinite execution, which is confusing to practi¬ 
tioners, since real systems need to get things done in a 
finite (and usually short) amount of time. 

Quoting Lamport “Liveness properties are in¬ 
herently problematic. The question of whether a real 
system satisfies a liveness property is meaningless; it 
can be answered only by observing the system for an 
infinite length of time, and real systems don’t run for¬ 
ever. Liveness is always an approximation to the prop¬ 
erty we really care about. We want a program to termi¬ 
nate within 100 years, but proving that it does would re¬ 
quire the addition of distracting timing assumptions. So, 
we prove the weaker condition that the program eventu¬ 
ally terminates. This doesn’tprove that the program will 
terminate within our lifetimes, but it does demonstrate 
the absence of infinite loops.” 

Brewer Cl and some commercial database ven¬ 
dors ([II] state that “all three properties [consistency, 
availability, and partition tolerance] are more contin¬ 
uous than binary”. This is in direct contradiction to 
Gilbert and Lynch’s formalization of CAP (and our 
restatement thereof), which expresses consistency and 
availability as safety and liveness properties of an algo¬ 
rithm, and partitions as a property of the system model. 
Such properties either hold or they do not hold; there is 
no degree of continuity in their definition. 

Brewer’s informal interpretation of CAP is intuitively 
appealing, but it is not a theorem, since it is not ex¬ 
pressed formally (and thus cannot be proved or dis¬ 
proved) - it is, at best, a rule of thumb. Gilbert and 
Lynch’s formalization can be proved correct, but it does 
not correspond to practitioners’ intuitions for real sys¬ 
tems. This contradiction suggests that although the for¬ 
mal model may be true, it is not useful. 


4 Alternatives to CAP 

In section |2] we explored the definitions of the terms 
consistency, availability and partition tolerance, and 
noted that a wide range of ambiguous and mutually 
incompatible interpretations have been proposed, lead¬ 
ing to widespread confusion. Then, in section 0 we ex¬ 
plored Gilbert and Lynch’s definitions and proofs in 
more detail, and highlighted some problems with the 
formalization of CAP. 

All of these misunderstandings and ambiguity lead 
us to asserting that CAP is no longer an appropriate 
tool for reasoning about systems. A better framework 
for describing trade-offs is required. Such a framework 
should be simple to understand, match most people’s in¬ 
tuitions, and use definitions that are formal and correct. 

In the rest of this paper we develop a first draft of 
an alternative framework called delay-sensitivity, which 
provides tools for reasoning about trade-offs between 
consistency and robustness to network faults. It is based 
on to several existing results from the distributed sys¬ 
tems literature (most of which in fact predate CAP). 

4.1 Latency and availability 

As discussed in section 12.1.21 the latency (response 
time) of operations is often important in practice, but 
it is deliberately ignored by Gilbert and Lynch. 

The problem with latency is that it is more difficult 
to model. Latency is influenced by many factors, espe¬ 
cially the delay of packets on the network. Many com¬ 
puter networks (including Ethernet and the Internet) do 
not guarantee bounded delay, i.e. they allow packets 
to be delayed arbitrarily. Latencies and network delays 
are therefore typically described as probability distribu¬ 
tions. 

On the other hand, network delay can model a wide 
range of faults. In network protocols that automatically 
retransmit lost packets (such as TCP), transient packet 
loss manifests itself to the application as temporarily 
increased delay. Even when the period of packet loss 
exceeds the TCP connection timeout, application-level 
protocols often retry failed requests until they succeed, 
so the effective latency of the operation is the time from 
the first attempt until the successful completion. Even 
network partitions can be modelled as large packet de¬ 
lays (up to the duration of the partition), provided that 
the duration of the partition is finite and lost packets are 
retransmitted an unbounded number of times. 
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Abadi ||2l argues that there is a trade-off between 
consistency and latency, which applies even when there 
is no network partition, and which is as important as the 
consistency/availability trade-off described by CAP. He 
proposes a “PACELC” formulation to reason about this 
trade-off. 

We go further, and assert that availability should be 
modeled in terms of operation latency. For example, we 
could define the availability of a service as the propor¬ 
tion of requests that meet some latency bound (e.g. re¬ 
turning successfully within 500 ms, as defined by an 
SLA). This empirically-founded definition of availabil¬ 
ity closely matches our intuitive understanding. 

We can then reason about a service’s tolerance of net¬ 
work problems by analyzing how operation latency is 
affected by changes in network delay, and whether this 
pushes operation latency over the limit set by the SLA. 
If a service can sustain low operation latency, even as 
network delay increases dramatically, it is more toler¬ 
ant of network problems than a service whose latency 
increases. 

4.2 How operation latency depends on 
network delay 

To find a replacement for CAP with a latency-centric 
viewpoint we need to examine how operation latency is 
affected by network latency at different levels of con¬ 
sistency. In practice, this depends on the algorithms and 
implementation of the particular software being used. 
However, CAP demonstrated that there is also interest 
in theoretical results identifying the fundamental lim¬ 
its of what can be achieved, regardless of the particular 
algorithm in use. 

Several existing impossibility results establish lower 
bounds on the operation latency as a function of the net¬ 
work delay d. These results show that any algorithm 
guaranteeing a particular level of consistency cannot 
perform operations faster than some lower bound. We 
summarize these results in table [T] and in the following 
sections. 

Our notation is similar to that used in complexity the¬ 
ory to describe the running time of an algorithm. How¬ 
ever, rather than being a function of the size of input, 
we describe the latency of an operation as a function of 
network delay. 

In this section we assume unbounded network delay, 
and unsynchronized clocks (i.e. each process has access 
to a clock that progresses monotonically at a rate ap- 


Consistency level 

write 

latency 

read 

latency 

linearizability 

0(d) 

0(d) 

sequential consistency 

0(d) 

0(1) 

causal consistency 

0(1) 

0(1) 


Table 1; Lowest possible operation latency at various 
consistency levels, as a function of network delay d. 

proximately equal to real time, but the synchronization 
error between clocks is unbounded). 

4.2.1 Linearizability 

Attiya and Welch |@] show that any algorithm imple¬ 
menting a linearizable read-write register must have an 
operation latency of at least m/2, where u is the uncer¬ 
tainty of delay in the network between replicasH 

In this proof, network delay is assumed to be at most 
d and at least d — u, so u is the difference between 
the minimum and maximum network delay. In many 
networks, the maximum possible delay (due to net¬ 
work congestion or retransmitting lost packets) is much 
greater than the minimum possible delay (due to the 
speed of light), sou Kid. If network delay is unbounded, 
operation latency is also unbounded. 

For the purposes of this survey, we can simplify the 
result to say that linearizability requires the latency of 
read and write operations to be proportional to the net¬ 
work delay d. This is indicated in table [Das 0(d) la¬ 
tency for reads and writes. We call these operations 
delay-sensitive, as their latency is sensitive to changes 
in network delay. 

4.2.2 Sequential consistency 

Lipton and Sandberg show that any algorithm im¬ 
plementing a sequentially consistent read-write register 
must have |r| -f |w| > d, where |r| is the latency of a 
read operation, |w| is the latency of a write operation, 
and d is the network delay. Mavronicolas and Roth 
further develop this result. 

This lower bound provides a degree of choice for the 
application: for example, an application that performs 

® Attiya and Welch 0 originally proved a bound of «/2 for write 
operations (assuming two writer processes and one reader), and a 
bound of m/4 for read operations (two readers, one writer). The m/2 
bound for read operations is due to Mavronicolas and Roth (Hi. 
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more reads than writes can reduce the average opera¬ 
tion latency by choosing |r| = 0 and |w| > d, whereas 
a write-heavy application might choose |r| > d and 
|w| = 0. Attiya and Welch |@1 describe algorithms for 
both of these cases (the |r| = 0 case is similar to the 
Zab algorithm used by Apache ZooKeeper 

Choosing |r| = 0 or |w| =0 means the operation can 
complete without waiting for any network communica¬ 
tion (it may still send messages, but need not wait for 
a response from other nodes). The latency of such an 
operation thus only depends on the local database algo¬ 
rithms; it might be constant-time (9(1), or it might be 
(9(logn) where n is the size of the database, but either 
way it is independent of the network delay d, so we call 
it delay-independent. 

In table[Tl sequential consistency is described as hav¬ 
ing fast reads and slow writes (constant-time reads, and 
write latency proportional to network delay), although 
these roles can be swapped if an application prefers fast 
writes and slow reads. 


4.2.3 Causal consistency 


If sequential consistency allows the latency of some op¬ 
erations to be independent of network delay, which level 
of consistency allows all operation latencies to be inde¬ 


pendent of the network? Recent results show that 


causal consistency |01 with eventual convergence is the 
stro^est possible consistency guarantee with this prop- 
ertyO 

Read Your Writes 14^ . PRAM |^] and other weak 
consistency models (all the way down to eventual con¬ 
sistency, which provides no safety property j^) are 
weaker than causal consistency, and thus achievable 
without waiting for the network. 


If tolerance of network delay is the only considera¬ 
tion, causal consistency is the optimal consistency level. 
There may be other reasons for choosing weaker con¬ 
sistency levels (for example, the metadata overhead of 
tracking causality but these trade-offs are out¬ 

side of the scope of this discussion, as they are also out¬ 
side the scope of CAP. 


®There are a few variants of causal consistency, such as real time 
causal |43| . causal+ O and obsen>able causal ii consistency. 
They have subtle differences, but we do not have space in this paper 
to compare them in detail. 


4.3 Heterogeneous delays 

A limitation of the results in section l4~2l is that they as¬ 
sume the distribution of network delays is the same be¬ 
tween every pair of nodes. This assumption is not true 
in general; for example, network delay between nodes 
in the same datacenter is likely to be much lower than 
between geographically distributed nodes communicat¬ 
ing over WAN links. 

If we model network faults as periods of increased 
network delay (section lO . then a network partition 
is a situation in which the delay between nodes within 
each partition remains small, while the delay across par¬ 
titions increases dramatically (up to the duration of the 
partition). 

For 0{d) algorithms, which of these different delays 
do we need to assume for dl The answer depends on 
the communication pattern of the algorithm. 


4.3.1 Modeling network topology 


For example, a replication algorithm that uses a sin¬ 
gle leader or primary node requires all write requests 
to contact the primary, and thus d in this case is the net¬ 
work delay between the client and the leader (possibly 
via other nodes). In a geographically distributed system, 
if client and leader are in different locations, d includes 
WAN links. If the client is temporarily partitioned from 
the leader, d increases up to the duration of the partition. 

By contrast, the ABD algorithm l|3l waits for re¬ 
sponses from a majority of replicas, so d is the largest 
among the majority of replicas that are fastest to re¬ 
spond. If a minority of replicas is temporarily parti¬ 
tioned from the client, the operation latency remains in¬ 
dependent of the partition duration. 

Another possibility is to treat network delay within 
the same datacenter, differently from net¬ 

work delay over WAN links, dfgjyiote^ because usu¬ 
ally dig^^i df-gmote- Systems such as COPS 11431] . 
which place a leader in each datacenter, provide lin- 
earizable operations within one datacenter (requiring 
0{diocal) latency), and causal consistency across da¬ 
tacenters (making the request latency independent of 
d remote)- 


4.4 Delay-independent operations 

The big-(9 notation for operation latency ignores con¬ 
stant factors (such as the number of network round-trips 
required by an algorithm), but it captures the essence of 
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what we need to know for building systems that can 
tolerate network faults; what happens if network delay 
dramatically degrades? In a delay-sensitive 0{d) algo¬ 
rithm, operation latency may increase to be as large as 
the duration of the network interruption (i.e. minutes or 
even hours), whereas a delay-independent 0{\) algo¬ 
rithm remains unaffected. 

If the SLA calls for operation latencies that are signif¬ 
icantly shorter than the expected duration of network in¬ 
terruptions, delay-independent algorithms are required. 
In such algorithms, the time until replica convergence 
is still proportional to d, but convergence is decou¬ 
pled from operation latency. Put another way, delay- 
independent algorithms support disconnected or offline 
operation. Disconnected operation has long been used 
in network file systems 13711 and automatic teller ma¬ 
chines lIT^ . 

For example, consider a calendar application running 
on a mobile device: a user may travel through a tunnel 
or to a remote location where there is no cellular net¬ 
work coverage. For a mobile device, regular network 
interruptions are expected, and they may last for days. 
During this time, the user should still be able to inter¬ 
act with the calendar app, checking their schedule and 
adding events (with any changes asynchronously prop¬ 
agated when an internet connection is next available). 

Flowever, even in environments with fast and re¬ 
liable network connectivity, delay-independent algo¬ 
rithms have been shown to have performance and seal- 
ability benefits: in this context, they are known as 
coordination-free ifoll or ALPS IQ systems. Many 
popular database integrity constraints can be imple¬ 
mented without synchronous coordination between 
replicas Q. 

4.5 Proposed terminology 

Much of the confusion around CAP is due to the am¬ 
biguous, counter-intuitive and contradictory definitions 
of terms such as availability, as discussed in section |2] 
In order to improve the situation and reduce misunder¬ 
standings, there is a need to standardize terminology 
with simple, formal and correct definitions that match 
the intuitions of practitioners. 

Building upon the observations above and the results 
cited in section [AT] we propose the following defini¬ 
tions as a first draft of a delay-sensitivity framework for 
reasoning about consistency and availability trade-offs. 
These definitions are informal and intended as a starting 
point for further discussion. 


Availability is an empirical metric, not a property of 
an algorithm. It is defined as the percentage of 
successful requests (returning a non-error response 
within a predefined latency bound) over some pe¬ 
riod of system operation. 

Delay-sensitive describes algorithms or operations 
that need to wait for network communication to 
complete, i.e. which have latency proportional to 
network delay. The opposite is delay-independent. 
Systems must specify the nature of the sen¬ 
sitivity (e.g. an operation may be sensitive to 
intra-datacenter delay but independent of inter¬ 
datacenter delay). A fully delay-independent sys¬ 
tem supports disconnected (offline) operation. 

Network faults encompass packet loss (both transient 
and long-lasting) and unusually large packet delay. 
Network partitions are just one particular type of 
network fault; in most cases, systems should plan 
for all kinds of network fault, and not only parti¬ 
tions. As long as lost packets or failed requests are 
retried, they can be modeled as large network de¬ 
lay. 

Fault tolerauce is used in preference to high availabil¬ 
ity or partition tolerance. The maximum fault that 
can be tolerated must be specified (e.g. “the al¬ 
gorithm can tolerate up to a minority of replicas 
crashing or disconnecting”), and the description 
must also state what happens if more faults occur 
than the system can tolerate (e.g. all requests return 
an error, or a consistency property is violated). 

Consistency refers to a spectrum of different con¬ 
sistency models (including linearizability and 
causal consistency), not one particular consistency 
model. When a particular consistency model such 
as linearizability is intended, it is referred to by its 
usual name. The term strong consistency is vague, 
and may refer to linearizability, sequential consis¬ 
tency or one-copy serializability. 

5 Conclusion 

In this paper we discussed several problems with the 
CAP theorem; the definitions of consistency, availabil¬ 
ity and partition tolerance in the literature are somewhat 
contradictory and counter-intuitive, and the distinction 
that CAP draws between “strong” and “eventual” con¬ 
sistency models is less clear than widely believed. 
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CAP has nevertheless been very influential in the 
design of distributed data systems. It deserves credit 
for catalyzing the exploration of the design space of 
systems with weak consistency guarantees, e.g. in the 
NoSQL movement. However, we believe that CAP has 
now reached the end of its usefulness; we recommend 
that it should be relegated to the history of distributed 
systems, and no longer be used for justifying design de¬ 
cisions. 

As an alternative to CAP, we propose a simple delay- 
sensitivity framework for reasoning about trade-offs be¬ 
tween consistency guarantees and tolerance of network 
faults in a replicated database. Every operation is cat¬ 
egorized as either 0{d), if its latency is sensitive to 
network delay, or (9(1), if it is independent of net¬ 
work delay. On the assumption that lost messages are 
retransmitted an unbounded number of times, we can 
model network faults (including partitions) as periods 
of greatly increased delay. The algorithm’s sensitivity to 
network delay determines whether the system can still 
meet its service level agreement (SLA) when a network 
fault occurs. 

The actual sensitivity of a system to network de¬ 
lay depends on its implementation, but - in keeping 
with the goal of CAP - we can prove that certain lev¬ 
els of consistency cannot be achieved without making 
operation latency proportional to network delay. These 
theoretical lower bounds are summarized in Table [T] 
We have not proved any new results in this paper, but 
merely drawn on existing distributed systems research 
dating mostly from the 1990s (and thus predating CAP). 

For future work, it would be interesting to model the 
probability distribution of latencies for different con¬ 
currency control and replication algorithms (e.g. by ex¬ 
tending PBS ifTHl '). rather than modeling network delay 
as just a single number d. It would also be interesting 
to model the network communication topology of dis¬ 
tributed algorithms more explicitly. 

We hope that by being more rigorous about the impli¬ 
cations of different consistency levels on performance 
and fault tolerance, we can encourage designers of dis¬ 
tributed data systems to continue the exploration of the 
design space. And we also hope that by adopting sim¬ 
ple, correct and intuitive terminology, we can help guide 
application developers towards the storage technologies 
that are most appropriate for their use cases. 
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